[Frontend] MLIR codegen on Python bindings: port C++ passes + affine-only DMA#259
Merged
Conversation
98b4f0f to
689e2c0
Compare
Run the MLIR -> LLVM pipeline in-process via the bindings PassManager. Add Python out-of-line passes: lower_to_llvm, lower_dma_to_gemmini, lower_vlane_idx. Auto-resolve the bindings path from TORCHSIM_LLVM_PATH and ship the bindings in the LLVM artifact. Add op_coverage tooling and the bindings smoke test. Bump the LLVM pin and rebuild the thirdparty base image.
Emit togsim.transfer for >4D DMA and decompose it to a <=4D customized memref.dma_start: unit-collapse fast path, unrolled-subview peel for >4 effective dims. Fix #258 by emitting affine.apply (not arith.addi) for the peeled DRAM offset so the TOG pass can walk the loop index through it.
Split a loop axis so aligned FloorDiv/ModularIndexing collapse to per-axis affine indices. Mixed-radix split over a divisibility chain; integer-typed split symbols; r-prefix innermost reduce dims. Reindex the collapsed LoopBody instead of re-tracing; fold residual floor/mod via tensor range info. Shared boundary helpers, rank guard, and an uncovered floor/mod ledger. Enabled by default with the recompile fallback instrumented.
… floor/mod Insert a copy to relayout an operand whose floor/mod cannot be removed by axis-split: incompatible-radix shared-axis access and cross-axis multi-variable arguments. Enabled by default alongside axis-split.
Port the analysis and IR-mutation halves of the C++ test-tile-operation-graph pass to Python, wire build_tog into the gem5 path, and drop the C++ pass. Node-id counter is thread-local for concurrent compilation.
…seek seed Add tests/ops/view/test_floormod_axis_split.py covering axis-split and graph-copy patterns. Seed the global RNG in the deepseek base test so config-random weights are deterministic.
a127c37 to
3c56c6b
Compare
…cal SRAM offset Rewrite the >4D peel to mirror the C++ -dma-fine-grained subtile loop: wrap the outer dims in an affine.for nest (marked inner_loop so build_tog/TOG registers the induction var) and emit one <=4D memref.dma_start per iteration. The slice SRAM offset is the lane-banked physical offset -- split-outer dims rescaled by the lane coeff (stride/old_size*new_size, the MVIN block_stride / buildSramAffineMap rule) -- delivered as the last SRAM index operand. The previous unrolled subview carried the offset in the subview, which extract_aligned_pointer_as_index strips in the gemmini lowering, so every slice aliased the same spad location (pixel_shuffle MISMATCH). The DRAM offset folds with the original index into one affine.apply so processDramIndices can walk the loop index (#258). Thread vectorlane (systolic-array size) through run_python_passes into the pass for the rescale's nr_outerloop. Drop the axis-split rank guard now that >4D is peeled correctly, and register tests/ops/view/test_floormod_axis_split.py in the CI allowlist. Validated end-to-end (Gem5+Spike+TOGSim): pixel_shuffle (>4D peel) and the full floor/mod suite pass; elementwise/gemm/conv2d/reduce/softmax/MLP regress clean.
Route every MVIN/MVOUT -- both the MLIRKernel load/store backend path and the template path (gemm/conv/bmm/maxpool/cat) -- through emit_transfer, so a single decompose-transfer pass lowers all DMAs to memref.dma_start. This drops the get_dma_code emitter, the _dma_needs_transfer instance flag, and format_dma_op_attributes. togsim.transfer now also carries subtile_size and async, which decompose propagates onto the lowered dma_start (subtile filtered to the kept axes when unit dims collapse). For <=4D tiles decompose emits the descriptor directly on the original SRAM buffer (no collapse_shape) so the C++ -dma-fine-grained subtile split, which walks the SRAM operand, sees a direct buffer as before. Validated end-to-end (Spike + TOGSim) on elementwise, gemm (matmul/addmm), bmm, conv2d, group_conv, pool, cat, reduce, softmax, layernorm, batchnorm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect a6b7ebb: the find_split_plan rank guard is gone (>4D index now lowers through the decompose-transfer affine.for peel, pixel_shuffle end-to-end), and the decompose-transfer peel <-> TOG incompatibility is resolved. Move it from Known-issues to Done; drop the >4D rank-guard caveat and the high-rank next-step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Port mlir/test/lib/Analysis/TestDmaFineGrained.cpp to a Python out-of-line pass (passes/dma_fine_grained.py): split the matmul MVIN DMAs (input/weight/bias) into subtile affine.for nests and fuse the input/weight nests, replacing the C++ -dma-fine-grained pass. The MLIR Python bindings expose no IRMapping, so the fused nest is built directly (each DMA emitted with the fused induction vars) instead of cloning bodies -- structurally equivalent, not byte-exact SSA text. Pipeline: the single mlir-opt invocation is split around the Python pass (loop-padding -> run_fine_grained in place -> pytorchsim-to-vcix) in both the functional and gem5 paths (extension_codecache); vectorlane (systolic-array size) is threaded in for the lane-banked SRAM offset rescale. Validated against mlir-opt -dma-fine-grained on rank 2/3/4 fixtures (matmul / bmm / conv: same vcix dma_start and line counts) and end-to-end (Gem5+Spike+TOGSim): gemm/bmm/conv2d plus the resnet/transformer/vit/mlp models pass. Docs: dma-transfer-lowering.md -- >4D peel is affine.for + lane-banked physical SRAM offset via the last index operand; dma_fine_grained / build_tog are now Python passes; the #258 appendix is marked resolved.
…ings)
Port mlir/test/lib/Conversion/PyTorchSimToVCIX/TestPyTorchSimToVCIXConversion.cpp to
a Python out-of-line pass (passes/lower_to_vcix.py): lower linalg.matmul (gemm and
conv2d) and the transcendental math ops (exp/erf/tanh/sin/cos) to VCIX dialect ops
(RISC-V vector custom instructions), replacing the C++ -test-pytorchsim-to-vcix.
The C++ pass is a dialect conversion (applyPartialConversion); the bindings expose no
conversion framework, so each matchAndRewrite is reimplemented as imperative IR
rewriting. The VCIX dialect is not in the Python bindings, so vcix ops are created as
unregistered generic ops -- mlir-opt / mlir-translate (vcix registered) re-parse the
{}-attr generic form fine, and run_standard_lowering already consumes vcix output via
allow_unregistered_dialects, so this matches the existing pipeline.
Pipeline: the vcix mlir-opt invocation is dropped; run_to_vcix runs in-process after
the Python fine-grained pass and before the standard lowering (both functional and
gem5 paths in extension_codecache). mlir-opt now runs only -test-loop-padding.
Validated structurally against mlir-opt -test-pytorchsim-to-vcix (non-constant ops
byte-identical including the dma_wait tag maps, on gemm and conv2d fixtures) and
numerically end-to-end (Gem5+Spike+TOGSim allclose): gemm/bmm/conv2d (incl. large
N/K), softmax, exp/erf/sin/cos, and the resnet18/vit/transformer/mlp models.
dma-fine-grained and pytorchsim-to-vcix are now Python passes (dma_fine_grained, lower_to_vcix); update the docstring listing -- only test-loop-padding still runs in mlir-opt.
This was referenced Jun 17, 2026
axis-split + graph-copy (on by default) linearize aligned floor/mod at the scheduling layer, so the index reaching get_dma_info is affine and the FloorDiv/ModularIndexing tile-divisibility branches there are never entered (measured: 0 entries across elementwise, gemm, bmm, conv, cat, floor/mod, reduce, attention). Remove those dead branches and their orphans: - the FloorDiv and ModularIndexing tile-forcing + RecompileSignal blocks - the implicit-ModularIndexing index rewrite and implicit_local_dims - the dead ModularIndexing branch in the dram_stride computation - is_modular_indexing, the write-only implicit_dim_size, unused import sys Kept: the non-floor/mod recompile paths (index-divisibility, indirect access, non-power-of-2 vec size), RecompileSignal, and the retry loop. The upstream implicit_dim_ops tile-forcing is left untouched (separate change). Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass, 0 recompiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-split) implicit_dim_ops/extract_dividers/apply_constraints forced the initial tile size to match a view's floor/mod divider, up front in compute_tile_size. axis-split now linearizes those views at the scheduling layer, so the forcing is redundant: disabling it leaves every test allclose-correct and, on the affected kernels, slightly faster (the forced tile was over-constrained -- batchnorm 1189->1114, layernorm 4092->3947 cycles; non-floor/mod kernels unchanged). Remove the machinery and its now-unused imports (ModularIndexing, FloorDiv, Mod, MemoryDep, StarDep, WeakDep). Validated end-to-end (Spike + TOGSim): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite, reduce, softmax, layernorm, batchnorm, gqa -- all pass, 0 recompiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the TPU layout-assignment & padding investigation (docs/tpu_layout_padding_report.md) and the loop-padding design doc. Settled model: padding is two layers -- (A) lane/sublane 8x128 alignment is materialized (footprint + DMA traffic), (B) the compute-block (MXU tile) boundary tail is masked (compute-utilization only, not traffic). test-loop-padding's post-codegen heuristic is to be replaced by informed emission at the scheduling/codegen layer (decide early, materialize late); the two costs must be modeled by separate functions (do not double-count the compute-block tail as traffic).
…ases Three fixes from the max-effort review of this branch: - get_dma_info: after retiring the floor/mod recompile branches, a residual floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible radix) that axis-split/graph-copy did not linearize was silently bucketed by its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor. Raise NotImplementedError instead of mis-striding silently. No test triggers it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net. - decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of each reassociation group, which is a unit dim when trailing unit dims attach after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the wrong axis. Pick the non-unit dim in each group. - decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine). Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle >4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
More fixes from the max-effort review, after verifying each against the C++ reference and reachability: - graph_copy _relayout_args: ranges picked the consumer iteration shape by rank alone (max key=len), so for two equal-rank operands with different per-dim extents the broadcast-from operand's smaller shape could win and the real incompatible-radix conflict on the broadcast-to dim was missed (order-dependent: a commutative reorder flipped correct relayout into a silent miss). Use per-dim max extent over the max-rank operands. - lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType -- F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm, as in the validated path) instead of being lowered to VCIX, and add the missing rank != 1 guard. - lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick one silently). Latent today (heuristic/autotune only emit SS-multiple tiles). - Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the two stale 'no-op unless set' comments. Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite, softmax, layernorm -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix exp chunk Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but the gemm/bmm/conv tests do not: - _lower_matmul bailed with 'if ATag is None or BTag is None: return False', gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V matmul, operand B is the softmax output produced in place by affine.vector_store, not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by either a dma_start OR a preceding affine.vector_store into its root memref; bail only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++). - _make_sf_vc_v_iv n>1 transcendental chunking called vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order, missing the result type and vector operand, raising TypeError under these MLIR bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile (n==1) tests never hit it. Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong); matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did not; exp chunking and fine-grained ruled out separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bug env vars axis-split and graph-copy are the floor/mod handling path and were default-on but still gated behind TORCHSIM_AXIS_SPLIT / TORCHSIM_GRAPH_COPY. Remove the gates so they run unconditionally, and delete the env vars that were only introduced for validation/debug during development: TORCHSIM_AXIS_SPLIT, TORCHSIM_GRAPH_COPY - default-on toggles TORCHSIM_AXIS_SPLIT_FORCE - force-split validation aid TORCHSIM_AXIS_LEDGER + axis_split.ledger() - coverage measurement TORCHSIM_DEBUG_AXIS_SPLIT + _dump_axis() - debug dump TORCHSIM_GRAPH_COPY_DEBUG - graph-copy debug prints TORCHSIM_RECOMPILE_LOG - vestigial recompile log Also drop the now-dead ledger() function, the _dump_axis() helper, and the unused os import in graph_copy.py. The floor/mod regression test no longer sets the removed env vars. Behavior is unchanged (the toggles were already on). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dma_fine_grained and lower_to_vcix already exposed run(module, **opts) like the registered decompose_transfer/lower_vlane_idx, but were called file-based and directly from extension_codecache, so the .mlir was parsed+printed twice (once per pass) between loop-padding and the standard lowering, and the pipeline was hardcoded+duplicated across the functional and gem5 paths. Give both passes MARKERS and group the four rewrite passes into PRE_OPT_PASSES / POST_OPT_PASSES around the one remaining mlir-opt pass (-test-loop-padding). A single driver run_module_passes(in, out, passes, **opts) parses once, runs each marker-matched pass on the shared Module in order, prints once (copies through when no marker matches). run_python_passes is now PRE_OPT via that driver; the functional/gem5 fine-grained+vcix calls each become one run_module_passes. run_fine_grained / run_to_vcix stay re-exported for standalone/CLI use. Validated (Spike+TOGSim): elementwise, gemm, conv2d, softmax, floor/mod suite, SDPA -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves the PyTorchSim MLIR codegen onto the MLIR Python bindings, ports the custom
C++ MLIR passes to Python, and enforces an affine-only DMA contract so codegen emits
only per-axis affine indices. After this PR the only C++ mlir-opt passes left in the
compile path are
-test-loop-paddingand-test-pytorchsim-to-vcix.Themes (10 commits):
coverage tooling, LLVM pin bump.
togsim.transferand decompose to <=4Dmemref.dma_start;unify all DMA codegen on
togsim.transfer; the >4D peel is anaffine.fornestwith the lane-banked physical SRAM offset (delivered as the last index operand),
which also fixed decompose-transfer peel output (memref.subview + unrolled dma_start) incompatible with TOG generation #258.
scheduling layer (mixed-radix, default-on, rank guard removed) + graph-copy for
operands axis-split can't linearize; coverage test
test_floormod_axis_split.py.-test-tile-operation-graphanalysis to Python.-dma-fine-grainedpass to Python (matmulMVIN subtile loops + input/weight fusion), run in-process between loop-padding and
vcix.
Validation
mlir-opt -dma-fine-grainedon rank 2/3/4(matmul / bmm / conv) — same vcix dma_start and line counts.
resnet / transformer / vit / mlp / mobilenet / llama models.
Notes
-test-loop-paddingand-test-pytorchsim-to-vcix; drop the now-dead C++ pass sources from the fork.🤖 Generated with Claude Code